Skip to content

Whitespace and comment tokens#387

Merged
dlang-bot merged 5 commits intodlang-community:masterfrom
WebFreak001:whitespace-and-comment-tokens
Apr 9, 2020
Merged

Whitespace and comment tokens#387
dlang-bot merged 5 commits intodlang-community:masterfrom
WebFreak001:whitespace-and-comment-tokens

Conversation

@WebFreak001
Copy link
Member

@WebFreak001 WebFreak001 commented Apr 8, 2020

This replaces the previous comment and trailingComment properties. Maintains full feature backwards-compatibility, only some @nogc code may now fail to compile if it used the comment or trailingComment properties from Token.

Creates a new trivia.d file which contains helpers to deal with trivia and moves the previous comment helper in lexer.d into this file.

Simplifies getTokensForParser implementation too.

I suggest looking at each commit separately, they are strictly separated in functionality.

Commit add unittest for current comment behavior (15e117a) just adds unittests to the previous behavior, it passed how it was that way.
Commit separate comment helpers into dparse.trivia (afe19b2) moves the comment helper structs to dparse.trivia, also copies an internal function and enum which was previously inside the getTokensForParser function
Commit make token attributes not apply to extraFields (4d93669) moves the extraFields mixin outside the @safe nothrow pure @nogc block, so it can decide on the attributes itself - this may introduce changes in other software using std.experimental.lexer as it weakens attributes
Commit attach whitespace & comment tokens to tokens (db2a669) does the actual changes: remove comment parsing from the getTokensForParser function, instead append trivia (almost the same logic, just simpler), adds compatibility getters to the Token struct and unittests a big block of trivia heavy code

Supersedes #149

for us this only changes the @nogc not being there anymore. This allows
for more control within TokenStructure aliases to create GC dependent
functions for example.
This replaces the previous comment and trailingComment properties.
Maintains full feature backwards-compatibility, only some `@nogc` code
may now fail to compile if it used the comment or trailingComment
properties from Token.

supersedes dlang-community#149
@WebFreak001 WebFreak001 force-pushed the whitespace-and-comment-tokens branch from 081566a to db2a669 Compare April 8, 2020 14:02
@WebFreak001
Copy link
Member Author

also nice change that was introduced alongside: Error messages containing the Token type are now much more readable:
src/dparse/trivia.d(516,5): Error: mutable method dparse.trivia.extractTrailingDdoc.FilterResult!(__lambda2, const(TokenStructure!(ubyte, "import dparse.lexer:TokenTriviaFields; mixin TokenTriviaFields;"))[]).FilterResult.empty is not callable using a const object
vs
src/dparse/trivia.d(516,5): Error: mutable method dparse.trivia.extractTrailingDdoc.FilterResult!(__lambda2, const(TokenStructure!(ubyte, "\x0a string comment;\x0a string trailingComment;\x0a\x0a int opCmp(size_t i) const pure nothrow @safe {\x0a if (index < i) return -1;\x0a if (index > i) return 1;\x0a return 0;\x0a }\x0a\x0a int opCmp(ref const typeof(this) other)const pure nothrow @safe {\x0a return opCmp(other.index);\x0a }\x0a"))[]).FilterResult.empty is not callable using a const object

@Hackerpilot
Copy link
Collaborator

Hackerpilot commented Apr 9, 2020

I think the dparse_src variable in meson.build needs to be updated.

@codecov
Copy link

codecov bot commented Apr 9, 2020

Codecov Report

Merging #387 into master will increase coverage by 0.55%.
The diff coverage is 99.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #387      +/-   ##
==========================================
+ Coverage   75.53%   76.09%   +0.55%     
==========================================
  Files           8        9       +1     
  Lines        6985     7123     +138     
==========================================
+ Hits         5276     5420     +144     
+ Misses       1709     1703       -6     
Impacted Files Coverage Δ
src/std/experimental/lexer.d 84.78% <90.00%> (ø)
src/dparse/trivia.d 99.40% <99.40%> (ø)
src/dparse/lexer.d 83.40% <100.00%> (-2.15%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c6c21d...14d5e3c. Read the comment docs.

@Hackerpilot
Copy link
Collaborator

Hackerpilot commented Apr 9, 2020

When we tag a release with these changes we should change the version number to v0.15.0 or v0.15.0-beta.1.

@RazvanN7
Copy link
Contributor

RazvanN7 commented Jun 15, 2021

I was trying to update phobos to use the latest dscanner [1], that in turn was updated to use the latest libdparse and stumbled into the same failure as @MoonlightSentinel. I lost a few hours investigating the failure and ended up at this same PR.

How can we fix this issue? It is currently blocking the phobos PR queue.

Also, we should be testing new additions against phobos (via dscanner) just to make sure that future additions do not break it.

[1] dlang/phobos#8136

@RazvanN7
Copy link
Contributor

@WebFreak001 any ideas for a fix?

@Geod24
Copy link
Member

Geod24 commented Jun 15, 2021

So the exception itself is legit, because for some reason one of the comment is garbage:

(gdb) frame 10
#10 0x000055a2e48a1ce8 in _D8dscanner8analysis18has_public_example21HasPublicExampleCheck__T8hasDittoTC6dparse3ast19FunctionDeclarationZQBsMFxCQBoQBkQBjZb (this=0x7f8fe5a7a200, decl=0x55a2e58e4da0)
    at src/dscanner/analysis/has_public_example.d:97
97			return parseComment(decl.comment, null).isDitto;
(gdb) p *decl
$2 = {hasAuto = false, hasRef = false, returnType = 0x0, name = {text = "castSwitch", line = 240, column = 6,
    index = 7400, type = 65 'A', leadingTrivia = 0x0, trailingTrivia = 0x0, memoizedLeadingComment = 0x0,
    memoizedTrailingComment = 0x0}, templateParameters = 0x55a2e58e4f80, parameters = 0x55a2e58e50f0,
  constraint = 0x0, functionBody = 0x55a2e58e5410, memberFunctionAttributes = 0x0,
  comment = "\343,\000\000\345,\000\000\354,\000\000\355,\000\000\356,\000\000\357,\000\000\363,\000\000\364,\000\000\000-\000\000&-\000\000'-\000\000(-\000\000--\000\000.-\000\000 \246\000\000*\246\000\000A\246\000\000B\246\000\000C\246\000\000D\246\000\000E\246\000\000F\246\000\000G\246\000\000H\246\000\000I\246\000\000J\246\000\000K\246\000\000L\246\000\000M\246\000\000N\246\000\000O\246\000\000P\246\000\000Q\246\000\000R\246\000\000S\246\000\000T\246\000\000U\246\000\000V\246\000\000W\246\000\000X\246\000\000Y\246\000\000Z\246\000\000[\246\000\000\\\246\000\000]\246\000\000^\246\000\000_\246\000\000`\246\000\000a\246\000\000b\246\000\000"..., attributes = 0x0, storageClasses = {0x55a2e58e4eb0}}

Removing code / comments makes the location move. It reeks of memory corruption.
Dumping the data to binary gives:

[227, 44, 0, 0, 229, 44, 0, 0, 236, 44, 0, 0, 237, 44, 0, 0, 238, 44, 0, 0, 239, 44, 0, 0, 243, 44, 0, 0, 244, 44, 0, 0, 0, 45, 0, 0, 38, 45, 0, 0, 39, 45, 0, 0, 40, 45, 0, 0, 45, 45, 0, 0, 46, 45, 0, 0, 32, 166, 0, 0, 42, 166, 0, 0, 65, 166, 0, 0, 66, 166, 0, 0, 67, 166, 0, 0, 68, 166, 0, 0, 69, 166, 0, 0, 70, 166, 0, 0, 71, 166, 0, 0, 72, 166, 0, 0, 73, 166, 0, 0, 74, 166, 0, 0, 75, 166, 0, 0, 76, 166, 0, 0, 77, 166, 0, 0, 78, 166, 0, 0, 79, 166, 0, 0, 80, 166, 0, 0, 81, 166, 0, 0, 82, 166, 0, 0, 83, 166, 0, 0, 84, 166, 0, 0, 85, 166, 0, 0, 86, 166, 0, 0, 87, 166, 0, 0, 88, 166, 0, 0, 89, 166, 0, 0, 90, 166, 0, 0, 91, 166, 0, 0, 92, 166, 0, 0, 93, 166, 0, 0, 94, 166, 0, 0, 95, 166, 0, 0, 96, 166, 0, 0, 97, 166, 0, 0, 98, 166, 0, 0, 99, 166, 0, 0, 100, 166, 0, 0, 101, 166, 0, 0, 102, 166, 0, 0, 103, 166, 0, 0, 104, 166, 0, 0, 105, 166, 0, 0, 106, 166, 0, 0, 107, 166, 0, 0, 108, 166, 0, 0, 109, .... ]

Which is either a 4 bytes pattern or an uint:

[11491, 11493, 11500, 11501, 11502, 11503, 11507, 11508, 11520, 11558, 11559, 11560, 11565, 11566, 42528, 42538, 42561, 42562, 42563, 42564, 42565, 42566, 42567, 42568, 42569, 42570, 42571, 42572, 42573, 42574, 42575, 42576, 42577, 42578, 42579, 42580, 42581, 42582, 42583, 42584, 42585, 42586, 42587, 42588, 42589, 42590, 42591, 42592, 42593, 42594, 42595, 42596, 42597, 42598, 42599, 42600, 42601, 42602, 42603, 42604, 42605, 42606, 42625, 42626, 42627, 42628, 42629, 42630, 42631, 42632, 42633, 42634, 42635, 42636, 42637, 42638, 42639, 42640, 42641, 42642, 42643, 42644, 42645, 42646, 42647, 42648, 42787, 42788, 42789, 42790, 42791, 42792, 42793, 42794, 42795, 42796, 42797, 42798, 42799, 42802, 42803, 42804, 42805, 42806, 42807, 42808, 42809, 42810, 42811, 42812, 42813, 42814, 42815, 42816, 42817, 42818, 42819, 42820, 42821, 42822, 42823, 42824, 42825, 42826, 42827, 42828, 42829, 42830, 42831, 42832, 42833, 42834, 42835, 42836, 42837, 42838, 42839, 42840, 42841, 42842, 42843, 42844, 42845, 42846, 42847, 42848, 42849, 42850, 42851, 42852, 42853, 42854, 42855, 42856, 42857, 42858, 42859, 42860, 42861, 42862, 42863, 42864, 42865, 42873, 42874, 42875, 42876, 42877, 42879, 42880, 42881, 42882, 42883, 42884, 42885, 42886, 42887, 42888, 42892, 42893, 42894, 42895, 42897, 42898, 42899, 42900, 42913, 42914, 42915, 42916, 42917, 42918, 42919, 42920, 42921, 42922, 43002, 43003, 43216, 43226, 43264, 43274, 43472, 43482, 43600, 43610, 44016, 44026, 64256, 64263, 64275, 64280, 65296, 65306, 65345, 65371, 66600, 66640, 66720, 66730, 69734, 69744, 69872, 69882, 69942, 69952, 70096, 70106, 71360, 71370, 119834, 119860, 119886, 119893, 119894, 119912, 119938, 119964, 119990, 119994, 119995, 119996, 119997, 120004, 120005, 120016, 120042, 120068, 120094, 120120, 120146, 120172, 120198, 120224, 120250, 120276, 120302, 120328, 120354, 120380, 120406, 120432, 120458, 120486, 120514, 120539, 120540, 120546, 120572, 120597, 120598]

This doesn't evoke anything obvious to me (perhaps indexes in the file?) but there is a clear pattern.

BTW while looking at the code I noticed this change:
db2a669#diff-073a681f37b7ac58a75cb15be50882a5075a2ebcaa53efc4bd97cffa037ab6ddR472-R475
Notably, the char[] => Token[] change. The data is duped, but Token has indirections (it contains strings), if any of those string is not properly allocated it could be a very bad idea. It might be safe but since I was looking for memory corruption this one stood out.

@RazvanN7
Copy link
Contributor

@WebFreak001 could you provide some assistance?

@WebFreak001
Copy link
Member Author

I think what Mathias said is already gonna be the issue - as this only happens with huge files it's probably either something with slices of data that has moved (source code as moved and the appender didn't dup the code) or there was previously an issue in DCD with a splitting list data structure that was fundamentally flawed somehow, but had such a big threshold that it never observed in smaller files.

@WebFreak001
Copy link
Member Author

WebFreak001 commented Jun 17, 2021

I have tried the zip here and a complete phobos checkout using dub -- -S phobos with ~master D-Scanner and ~master libdparse, however I could not reproduce a crash on my machine.

Also tried compiling D-Scanner with LDC in release build

Also tried make

Got the issue now, forgot the --config parameter

@WebFreak001 WebFreak001 deleted the whitespace-and-comment-tokens branch June 17, 2021 16:53
@WebFreak001
Copy link
Member Author

WebFreak001 commented Jun 17, 2021

OK so the strings of the trivia parts are all fine in memory - the combined cached string gets overwritten by the GC (in my case I had it that std.regex was doing some allocation and the GC gave it part of that memory) which causes the error.

When the comment (string) is first assigned in parser.d in parseFunctionDeclaration the content is correct, later the same string data pointer however causes the UTFException and is no longer correct.

To me this seems like the GC either loses track of this string (thinking it's not allocated) or it's wrongly allocated.

screenshot when GC overrode the comment content

@WebFreak001
Copy link
Member Author

I found one issue by accident just now:

The comment bytes are cast to string without interning (because of the config.commentBehavior = CommentBehavior.noIntern)

Maybe we shouldn't override what the user provides? Although before it should have happened too, seems like the increased memory usage now might be the cause for the code now being freed while it wasn't before.

This is not a fix for the issue, but rather prevents issues with trying to access comments later appearing. This seems to have made the original issue trigger even earlier (on an earlier token) in analyzing - indicating that it might be relevant how much memory is allocated.

The weird thing is that it is the string returned by appender!string that gets corrupted, which shouldn't happen?

Fun fact:
calling GC.addRoot(&ret.data[0]); on the appender data causes the issue to go away.

@WebFreak001
Copy link
Member Author

maybe the comment generation should use the normal StringCache and be generated beforehand instead of lazily computing it when needed?

@Geod24
Copy link
Member

Geod24 commented Jun 26, 2021

calling GC.addRoot(&ret.data[0]); on the appender data causes the issue to go away.

Could it be that the pointer for the data is not aligned, causing the GC to miss it ?

@WebFreak001
Copy link
Member Author

The data is allocated by appender so the address should be allocated.

The string itself holding the pointer value to that data is inside a Token structure, which doesn't do special alignments. Here are the definitions however:

mixin (extraFields);

string memoizedLeadingComment = null;

These tokens are themself an array, allocated by an appender in

auto leadingTriviaAppender = appender!(Token[])();

I don't know, does appender remove the alignment? Otherwise it would all be aligned well.

So the data is stored like this:

  • list of tokens: Token[] (appender allocated)
    • list of trivia per token: Token[] (appender allocated)
      • leading/trailing comment: string (interned, not corrupted on building)
    • memoized combined & processed leading/trailing comment: string (appender allocated, first not corrupted - later corrupted)

@RazvanN7
Copy link
Contributor

RazvanN7 commented Jul 5, 2021

Until we get to the bottom of this, would it make sense to revert this PR and update Dscanner so that the phobos AliasAssign PRs get unblocked?

@WebFreak001
Copy link
Member Author

I think it might be better to leave this API in as this has been in for 3 releases now already.

As a workaround it might be possible to call internString on the appender-allocated string to make it actually permanent in memory.

@RazvanN7
Copy link
Contributor

RazvanN7 commented Jul 8, 2021

Would setting config.commentBehavior to CommentBehavior.intern work around the issue?

@WebFreak001
Copy link
Member Author

WebFreak001 commented Jul 8, 2021

the comments of the tokens are not corrupting, they are staying fine as they are already interned (at least also did the tests using this option) - the thing that is corrupting regardless is the memoized comment/trailingComment in https://github.com/dlang-community/libdparse/pull/387/files#diff-073a681f37b7ac58a75cb15be50882a5075a2ebcaa53efc4bd97cffa037ab6ddR146 that is created using an appender.

When the string is assigned to another variable in the parser.d FunctionDeclaration it's still valid, however a little later it gets corrupted.

@CyberShadow
Copy link
Member

This commit introduced a regression manifesting in D-Scanner:

Fix: #445

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants